feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out#20
feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out#20
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds RGBA output support for NV24/NV42 and introduces a “Strategy A” optimization to avoid running YUV→RGB math twice when both RGB and RGBA outputs are attached.
Changes:
- Add NV24/NV42 RGBA row converters (scalar + SIMD dispatch) and wire them into
MixedSinker. - Implement Strategy A fan-out (
RGB -> RGBAexpansion) to reuse the RGB kernel output when RGBA is also requested. - Add extensive tests covering NV24/NV42 RGBA behavior and SIMD-vs-scalar equivalence.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/sinker/mixed.rs | Wires NV24/NV42 RGBA buffers, adds Strategy A selection logic, and adds sinker-level tests. |
| src/row/scalar.rs | Adds NV24/NV42 scalar RGBA kernels and the expand_rgb_to_rgba_row helper + tests. |
| src/row/mod.rs | Exposes NV24/NV42 RGBA row APIs with SIMD dispatch. |
| src/row/arch/x86_sse41.rs | Adds SSE4.1 NV24/NV42 RGBA kernels and equivalence tests. |
| src/row/arch/x86_avx2.rs | Adds AVX2 NV24/NV42 RGBA kernels and equivalence tests. |
| src/row/arch/x86_avx512.rs | Adds AVX-512 NV24/NV42 RGBA kernels and equivalence tests. |
| src/row/arch/wasm_simd128.rs | Adds wasm simd128 NV24/NV42 RGBA kernels and equivalence tests. |
| src/row/arch/neon.rs | Adds NEON NV24/NV42 RGBA kernels and equivalence tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// | ||
| /// See [`MixedSinker::<Nv24>::with_rgba`] for the same rationale and | ||
| /// constraints; Nv42 differs only in chroma byte order (V before U). | ||
| /// Returns `Err(RgbaBufferTooShort)` if `buf.len() < width × height × 4`. |
There was a problem hiding this comment.
The Nv42::with_rgba doc currently mentions only RgbaBufferTooShort, but the implementation calls self.frame_bytes(4)?, which can also return GeometryOverflow (e.g., on 32-bit targets). Update the doc comment to match the actual error surface (similar to the Nv24 docs above).
| /// Returns `Err(RgbaBufferTooShort)` if `buf.len() < width × height × 4`. | |
| /// Returns `Err(GeometryOverflow)` if `width × height × 4` cannot be | |
| /// represented on the current target, or `Err(RgbaBufferTooShort)` if | |
| /// `buf.len() < width × height × 4`. |
| /// `out.len() >= width * (if ALPHA { 4 } else { 3 })`. | ||
| #[cfg_attr(not(tarpaulin), inline(always))] | ||
| fn nv24_or_nv42_to_rgb_row_impl<const SWAP_UV: bool>( | ||
| pub(crate) fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>( |
There was a problem hiding this comment.
This shared kernel function appears to be an internal implementation detail (and was previously private). If it’s not used outside row::scalar, consider making it fn (module-private) again to avoid unnecessarily widening the crate-internal API surface and reduce coupling.
| pub(crate) fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>( | |
| fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>( |
| for x in 0..width { | ||
| rgba_out[x * 4] = rgb[x * 3]; | ||
| rgba_out[x * 4 + 1] = rgb[x * 3 + 1]; | ||
| rgba_out[x * 4 + 2] = rgb[x * 3 + 2]; | ||
| rgba_out[x * 4 + 3] = 0xFF; |
There was a problem hiding this comment.
expand_rgb_to_rgba_row is likely on a hot path. Consider iterating via chunks_exact(3) zipped with chunks_exact_mut(4) (or a similar approach) to avoid repeated index multiplications and help the compiler generate a tighter loop.
| for x in 0..width { | |
| rgba_out[x * 4] = rgb[x * 3]; | |
| rgba_out[x * 4 + 1] = rgb[x * 3 + 1]; | |
| rgba_out[x * 4 + 2] = rgb[x * 3 + 2]; | |
| rgba_out[x * 4 + 3] = 0xFF; | |
| for (rgb_px, rgba_px) in rgb[..width * 3] | |
| .chunks_exact(3) | |
| .zip(rgba_out[..width * 4].chunks_exact_mut(4)) | |
| { | |
| rgba_px[0] = rgb_px[0]; | |
| rgba_px[1] = rgb_px[1]; | |
| rgba_px[2] = rgb_px[2]; | |
| rgba_px[3] = 0xFF; |
| if let Some(buf) = rgba.as_deref_mut() { | ||
| let rgba_plane_end = | ||
| one_plane_end | ||
| .checked_mul(4) | ||
| .ok_or(MixedSinkerError::GeometryOverflow { | ||
| width: w, | ||
| height: h, | ||
| channels: 4, | ||
| })?; | ||
| let rgba_plane_start = one_plane_start * 4; | ||
| expand_rgb_to_rgba_row(rgb_row, &mut buf[rgba_plane_start..rgba_plane_end], w); | ||
| } |
There was a problem hiding this comment.
The RGBA-plane bounds/overflow check + slice computation is duplicated across multiple PixelSink impls (Yuv420p/Yuv422p/Yuv444p/Nv12/Nv16/Nv21/Nv24/Nv42). Consider extracting a small helper (e.g., a method that returns the mutable RGBA row slice for [one_plane_start..one_plane_end]) to reduce repetition and the risk of future drift between formats.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| debug_assert!(uv_or_vu.len() >= 2 * width, "chroma row too short"); | ||
| debug_assert!(rgb_out.len() >= width * 3, "rgb_out row too short"); | ||
| let bpp: usize = if ALPHA { 4 } else { 3 }; | ||
| debug_assert!(out.len() >= width * bpp, "out row too short for {bpp}bpp"); |
There was a problem hiding this comment.
The debug_assert! message uses {bpp} inside a string literal, so the actual bpp value will not be interpolated. Consider using a formatted message (e.g., with {} + bpp) so debug builds report the concrete expected stride.
| debug_assert!(out.len() >= width * bpp, "out row too short for {bpp}bpp"); | |
| debug_assert!(out.len() >= width * bpp, "out row too short for {}bpp", bpp); |
Tranche 4b of Ship 8 sink-side RGBA. Adds
Nv24/Nv42(semi-planar 4:4:4) RGBA output via the dual-const-generic<SWAP_UV, ALPHA>template established by PR #17 (NV12 / NV21), and retro-applies a Strategy A combined RGB→RGBA fan-out to all 8 wired families so callers attaching bothwith_rgbandwith_rgbano longer pay the per-pixel YUV→RGB math twice — addresses the Copilot review finding from PR #19 (src/sinker/mixed.rs:1648).Scope
Yuv420pNv12,Nv21Yuv422p,Nv16Yuv444pNv24,Nv42Yuv440pyuv_444_to_rgba_row)Yuv420p9/10/12/14/16,P010/P012/P016Yuv422p9/10/12/14/16,Yuv440p10/12,P210/P212/P216Yuv444p9/10/12/14/16,P410/P412/P416Usage:
```rust
use colconv::{
frame::Nv24Frame,
sinker::MixedSinker,
yuv::{Nv24, nv24_to},
ColorMatrix,
};
let frame = Nv24Frame::new(&y_plane, &uv_plane, w, h, w, 2 * w);
let mut rgb = vec![0u8; (w * h * 3) as usize];
let mut rgba = vec![0u8; (w * h * 4) as usize];
let mut sinker = MixedSinker::::new(w as usize, h as usize)
.with_rgb(&mut rgb)?
.with_rgba(&mut rgba)?;
// Both buffers requested → YUV→RGB math runs once, RGBA derived via the
// Strategy A fan-out (no double per-pixel cost).
nv24_to(&frame, /full_range=/ true, ColorMatrix::Bt709, &mut sinker)?;
```
What's in this PR
Public API
Kernel work — NV24/NV42 RGBA
Mirrors PR #17 (NV12/NV21) shape: dual const generic `<const SWAP_UV: bool, const ALPHA: bool>` on a single shared `nv24_or_nv42_to_rgb_or_rgba_row_impl` kernel per backend, with 4 thin wrappers (NV24 RGB / NV42 RGB / NV24 RGBA / NV42 RGBA) forwarding the 4 `(SWAP_UV, ALPHA)` combinations. The compiler monomorphizes into 4 separate functions; the `if ALPHA` branch and the unused alpha-vector splat are DCE'd at each call site.
Strategy A — combined RGB→RGBA fan-out
Tranches 1–4a wired RGB and RGBA as independent kernel calls — when a caller attached both `with_rgb` and `with_rgba`, `MixedSinker::process` ran the YUV→RGB per-pixel math twice. Copilot review of PR #19 (`src/sinker/mixed.rs:1648`) flagged this.
This PR addresses it by:
Effective memory traffic for the both-buffers case: 3W RGB write + 3W L1-hot read + 4W RGBA write ≈ 7W — same as a hypothetical combined kernel ("Strategy B"), at ~1/10th the LOC.
Strategy B (a third const generic on every kernel doing both stores per pixel) is documented as a future follow-up in `docs/color-conversion-functions.md` § Ship 8 — only worth the LOC cost if profiling later shows the L1-readback step matters.
MixedSinker integration
`with_rgba` / `set_rgba` declared on format-specific impl blocks (per PR #16 safety pattern) — attaching RGBA to a sink that doesn't write it is a compile error rather than a silent stale-buffer bug. The `compile_fail` doctest negative example moved forward from `Nv24` to `Yuv440p` (next not-yet-wired format).
Doc updates
Tests
+16 lib tests on aarch64 (475 vs. 459 in PR #19); per-backend tests on the other 4 SIMD backends fire on their matching CI runners.
Per-backend tests bypass the dispatcher (call each backend's `unsafe nv24_to_rgba_row` / `nv42_to_rgba_row` directly under runtime feature detection) so on AVX-512-capable CI runners all three x86 paths run.
Local results (aarch64 macOS): 475 lib tests + 1 doctest pass; wasm32 + x86_64 cross-targets compile clean.
What's deferred
PixelSinkcontract #2 — kept `pub(crate)` here for consistency with NV12/NV21's existing pattern; should land as a sweep across all `_impl`s) and RGBA-plane bounds-check helper extraction across all 8 `process` impls (Copilot finding feat(yuv420p10): 10-bit YUV 4:2:0 planar → u8 + native u16 RGB #4).Test plan
🤖 Generated with Claude Code